Goto

Collaborating Authors

 candidate feature


AIMatDesign: Knowledge-Augmented Reinforcement Learning for Inverse Materials Design under Data Scarcity

Yu, Yeyong, Bian, Xilei, Xiong, Jie, Wu, Xing, Qian, Quan

arXiv.org Artificial Intelligence

With the growing demand for novel materials, machine learning-driven inverse design methods face significant challenges in reconciling the high-dimensional materials composition space with limited experimental data. Existing approaches suffer from two major limitations: (I) machine learning models often lack reliability in high-dimensional spaces, leading to prediction biases during the design process; (II) these models fail to effectively incorporate domain expert knowledge, limiting their capacity to support knowledge-guided inverse design. To address these challenges, we introduce AIMatDesign, a reinforcement learning framework that addresses these limitations by augmenting experimental data using difference-based algorithms to build a trusted experience pool, accelerating model convergence. To enhance model reliability, an automated refinement strategy guided by large language models (LLMs) dynamically corrects prediction inconsistencies, reinforcing alignment between reward signals and state value functions. Additionally, a knowledge-based reward function leverages expert domain rules to improve stability and efficiency during training. Our experiments demonstrate that AIMatDesign significantly surpasses traditional machine learning and reinforcement learning methods in discovery efficiency, convergence speed, and success rates. Among the numerous candidates proposed by AIMatDesign, experimental synthesis of representative Zr-based alloys yielded a top-performing BMG with 1.7GPa yield strength and 10.2\% elongation, closely matching predictions. Moreover, the framework accurately captured the trend of yield strength variation with composition, demonstrating its reliability and potential for closed-loop materials discovery.


Bayesian sparse modeling for interpretable prediction of hydroxide ion conductivity in anion-conductive polymer membranes

Murakami, Ryo, Miyatake, Kenji, Mahmoud, Ahmed Mohamed Ahmed, Yoshikawa, Hideki, Nagata, Kenji

arXiv.org Machine Learning

Their hydroxide ion conductivity varies depending on factors such as the type and distribution of quaternary ammonium groups, as well as the structure and connectivity of hydrophilic and hydrophobic domains. In particular, the size and connectivity of hydrophilic domains significantly influence the mobility of hydroxide ions; however, this relationship has remained largely qualitative. In this study, we calculated the number of key constituent elements in the hydrophilic and hydrophobic units based on the copolymer composition, and investigated their relationship with hydroxide ion conductivity by using Bayesian sparse modeling. As a result, we successfully identified composition-derived features that are critical for accurately predicting hydroxide ion conductivity. KEYWORDS anion-conductive polymer membranes; Materials informatics; Data-driven science; Sparse modeling; Bayesian inference 1. Introduction Anion-conductive polymer membranes are promising candidates for use as solid electrolytes in alkaline energy devices, such as fuel cells and water electrolysis cells. In particular, anion exchange membrane water electrolysis systems, which can produce green hydrogen efficiently by utilizing renewable energy sources, are being actively investigated worldwide as a core technology for realizing a carbon-neutral hydrogen society. For such applications, desirable properties of anion-conductive polymers include anion conductivity comparable to that of alkaline aqueous electrolytes, the ability to form thin membranes (thickness < 50µm) with sufficient mechanical strength, gasCONTACT Ryo Murakami.


Heterogeneous Random Forest

Kim, Ye-eun, Kim, Seoung Yun, Kim, Hyunjoong

arXiv.org Machine Learning

Random forest (RF) stands out as a highly favored machine learning approach for classification problems. The effectiveness of RF hinges on two key factors: the accuracy of individual trees and the diversity among them. In this study, we introduce a novel approach called heterogeneous RF (HRF), designed to enhance tree diversity in a meaningful way. This diversification is achieved by deliberately introducing heterogeneity during the tree construction. Specifically, features used for splitting near the root node of previous trees are assigned lower weights when constructing the feature sub-space of the subsequent trees. As a result, dominant features in the prior trees are less likely to be employed in the next iteration, leading to a more diverse set of splitting features at the nodes. Through simulation studies, it was confirmed that the HRF method effectively mitigates the selection bias of trees within the ensemble, increases the diversity of the ensemble, and demonstrates superior performance on datasets with fewer noise features. To assess the comparative performance of HRF against other widely adopted ensemble methods, we conducted tests on 52 datasets, comprising both real-world and synthetic data. HRF consistently outperformed other ensemble methods in terms of accuracy across the majority of datasets.


Distribution Guided Active Feature Acquisition

Li, Yang, Oliva, Junier

arXiv.org Artificial Intelligence

Human agents routinely reason on instances with incomplete and muddied data (and weigh the cost of obtaining further features). In contrast, much of ML is devoted to the unrealistic, sterile environment where all features are observed and further information on an instance is obviated. Here we extend past static ML and develop an active feature acquisition (AFA) framework that interacts with the environment to obtain new information on-the-fly and can: 1) make inferences on an instance in the face of incomplete features, 2) determine a plan for feature acquisitions to obtain additional information on the instance at hand. We build our AFA framework on a backbone of understanding the information and conditional dependencies that are present in the data. First, we show how to build generative models that can capture dependencies over arbitrary subsets of features and employ these models for acquisitions in a greedy scheme. After, we show that it is possible to guide the training of RL agents for AFA via side-information and auxiliary rewards stemming from our generative models. We also examine two important factors for deploying AFA models in real-world scenarios, namely interpretability and robustness. Extensive experiments demonstrate the state-of-the-art performance of our AFA framework.


Cost-constrained multi-label group feature selection using shadow features

Klonecki, Tomasz, Teisseyre, Paweł, Lee, Jaesung

arXiv.org Machine Learning

We consider the problem of feature selection in multi-label classification, considering the costs assigned to groups of features. In this task, the goal is to select a subset of features that will be useful for predicting the label vector, but at the same time, the cost associated with the selected features will not exceed the assumed budget. Solving the problem is of great importance in medicine, where we may be interested in predicting various diseases based on groups of features. The groups may be associated with parameters obtained from a certain diagnostic test, such as a blood test. Because diagnostic test costs can be very high, considering cost information when selecting relevant features becomes crucial to reducing the cost of making predictions. We focus on the feature selection method based on information theory. The proposed method consists of two steps. First, we select features sequentially while maximizing conditional mutual information until the budget is exhausted. In the second step, we select additional cost-free features, i.e., those coming from groups that have already been used in previous steps. Limiting the number of added features is possible using the stop rule based on the concept of so-called shadow features, which are randomized counterparts of the original ones. In contrast to existing approaches based on penalized criteria, in our method, we avoid the need for computationally demanding optimization of the penalty parameter. Experiments conducted on the MIMIC medical database show the effectiveness of the method, especially when the assumed budget is limited.


CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design

Neehal, Nafis, Wang, Bowen, Debopadhaya, Shayom, Dan, Soham, Murugesan, Keerthiram, Anand, Vibha, Bennett, Kristin P.

arXiv.org Artificial Intelligence

CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models' ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial's start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two LM-based evaluation methods are developed to compare the actual baseline feature lists against LM-generated responses. "ListMatch-LM" and "ListMatch-BERT" use GPT-4o and BERT scores (at various thresholds), respectively, for evaluation. To establish baseline results, advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings are applied to generate potential baseline features. The performance of GPT-4o as an evaluator is validated through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. The results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.


FeatNavigator: Automatic Feature Augmentation on Tabular Data

Liang, Jiaming, Lei, Chuan, Qin, Xiao, Zhang, Jiani, Katsifodimos, Asterios, Faloutsos, Christos, Rangwala, Huzefa

arXiv.org Artificial Intelligence

Data-centric AI focuses on understanding and utilizing high-quality, relevant data in training machine learning (ML) models, thereby increasing the likelihood of producing accurate and useful results. Automatic feature augmentation, aiming to augment the initial base table with useful features from other tables, is critical in data preparation as it improves model performance, robustness, and generalizability. While recent works have investigated automatic feature augmentation, most of them have limited capabilities in utilizing all useful features as many of them are in candidate tables not directly joinable with the base table. Worse yet, with numerous join paths leading to these distant features, existing solutions fail to fully exploit them within a reasonable compute budget. We present FeatNavigator, an effective and efficient framework that explores and integrates high-quality features in relational tables for ML models. FeatNavigator evaluates a feature from two aspects: (1) the intrinsic value of a feature towards an ML task (i.e., feature importance) and (2) the efficacy of a join path connecting the feature to the base table (i.e., integration quality). FeatNavigator strategically selects a small set of available features and their corresponding join paths to train a feature importance estimation model and an integration quality prediction model. Furthermore, FeatNavigator's search algorithm exploits both estimated feature importance and integration quality to identify the optimized feature augmentation plan. Our experimental results show that FeatNavigator outperforms state-of-the-art solutions on five public datasets by up to 40.1% in ML model performance.


Automated Feature Selection for Inverse Reinforcement Learning

Baimukashev, Daulet, Alcan, Gokhan, Kyrki, Ville

arXiv.org Artificial Intelligence

Inverse reinforcement learning (IRL) is an imitation learning approach to learning reward functions from expert demonstrations. Its use avoids the difficult and tedious procedure of manual reward specification while retaining the generalization power of reinforcement learning. In IRL, the reward is usually represented as a linear combination of features. In continuous state spaces, the state variables alone are not sufficiently rich to be used as features, but which features are good is not known in general. To address this issue, we propose a method that employs polynomial basis functions to form a candidate set of features, which are shown to allow the matching of statistical moments of state distributions. Feature selection is then performed for the candidates by leveraging the correlation between trajectory probabilities and feature expectations. We demonstrate the approach's effectiveness by recovering reward functions that capture expert policies across non-linear control tasks of increasing complexity. Code, data, and videos are available at https://sites.google.com/view/feature4irl.


Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem

Nam, Yunbi, Han, Sunwoo

arXiv.org Machine Learning

Random Forest is a machine learning method that offers many advantages, including the ability to easily measure variable importance. Class balancing technique is a well-known solution to deal with class imbalance problem. However, it has not been actively studied on RF variable importance. In this paper, we study the effect of class balancing on RF variable importance. Our simulation results show that over-sampling is effective in correctly measuring variable importance in class imbalanced situations with small sample size, while under-sampling fails to differentiate important and non-informative variables. We then propose a variable selection algorithm that utilizes RF variable importance and its confidence interval. Through an experimental study using many real and artificial datasets, we demonstrate that our proposed algorithm efficiently selects an optimal feature set, leading to improved prediction performance in class imbalance problem.


OpenFE: Automated Feature Generation with Expert-level Performance

Zhang, Tianping, Zhang, Zheyu, Fan, Zhiyuan, Luo, Haoyan, Liu, Fengyuan, Liu, Qian, Cao, Wei, Li, Jian

arXiv.org Artificial Intelligence

The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify effective features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves high efficiency and accuracy with two components: 1) a novel feature boosting method for accurately evaluating the incremental performance of candidate features and 2) a two-stage pruning algorithm that performs feature pruning in a coarse-to-fine manner. Extensive experiments on ten benchmark datasets show that OpenFE outperforms existing baseline methods by a large margin. We further evaluate OpenFE in two Kaggle competitions with thousands of data science teams participating. In the two competitions, features generated by OpenFE with a simple baseline model can beat 99.3% and 99.6% data science teams respectively. In addition to the empirical results, we provide a theoretical perspective to show that feature generation can be beneficial in a simple yet representative setting. The code is available at https://github.com/ZhangTP1996/OpenFE.